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1. INTRODUCTION 

In today's digital world due to evolution in technology voluminous data is increasing at rapid speed. 
The data generated on fly is called as a streaming data. The streaming data analysis using machine learning 
techniques endure many challenges like high data velocity, high volume of data, change in the underlying 
distribution over the time. For example, the properties of malicious uniform resource locator (URLs) and 
fraudulent transactions as well as spam tweets posted by spammers are changing continuously [1], [2]. In the 
analysis classification of instances from data stream, model have a hypothesis which finds mapping between 
feature variables (X) and target variables (Y) which are called as labels of the instances. There is a need of 
adaptive machine learning models which are able to adapt themselves to new underlying distribution which is 
known as a concept drift. The concept drifts are mainly further categorized as virtual and real drift. 

Virtual drift: There is a change in the distribution of features p(x) of instances or change in the 
distribution of concepts or target variables p(y). Real drift: The relationship between input variables and target 
concept is changing. This change is change in the likelihood p(x|y) thus pt(x|y) # pt + 1(x|y) or change in 
posterior probability distribution p(y|x) thus pt(y|x) # pt + 1(y|x). The real drift affects on the decision 
boundary. In many real-world problems, most of the time they occur at same time. Dealing with concept drift 
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issue in the data stream classification task has become challenging and has become attentive in research 
community. 

Earlier approaches have used some statistical change detection tests to monitor concept drift. Instead 
of using accuracy, drift detection method for online class imbalance (DDM-OCI) [3] monitors the class recall 
to deal with imbalance issue. For detecting drift on positive and negative class, linear four rates [4] have used 
true positive rate, true negative rate, positive predicted value and negative predicted value. Page—hinkley 
(PAUC-PH) [5] uses PH-test [6] to detect drift. Another important issue which affects the classification 
model’s performance is class imbalance where number of instances of one of the class is dominant over the 
other. When imbalance and concept drift both these problems occur at same time in data stream, they will tend 
to exasperate each other. When class imbalance occurs, it becomes difficult to detect the concept drift and 
conform the model to new distribution. Active and Passive approaches have been used for handling the concept 
drifts. Active approach involves explicit detection technique while passive approach is based on adaption of 
model. Passive approach is more successful as compared to active which overcomes the limitations in an active 
approach. Class imbalance in stationary or in static environment is most addressed problem using various 
techniques. But there are only few models have been found which are dealing with both concept drift and class 
imbalance simultaneously. These models are categorized as online and chunk-based models. Chunk based 
models mostly have used ensemble learning approach. Online learning models [6], [7] adapt themselves for 
every instance arriving in the stream. These are more effective in handling abrupt kind of drift. In chunk-based 
learning, model is not adapting itself until certain number of instances are not collected in a buffer, whose size 
is mostly pre-decided and it is fixed throughout the analysis of data stream. Some chunk-based methods have 
used assignment of dynamic weights to component classifiers in ensemble model based on the accuracy 
measure [8]. 

There were some fixed size chunk based methods proposed which were used for classification 
imbalanced non-stationary data streams [9], [10]. In uncorrelated bagging [11] current chunk is balanced by 
preserving the minority class examples from previous chunks. But here the limitation is usage of memory for 
storing past data instances and also this can’t adapt to new concept rapidly. Improvement in this technique is 
observed in selectively recursive (SERA) [12] and in recursive ensemble approach (REA) [13] by selecting 
only most similar past minority instances. Ditzler and Polikar [14] proposed two chunk-based ensembles called 
learn++. CDS that is concept drift with smote and learnt++. NIE which is non-stationary imbalanced 
environment. Both are inspired from learn++. NSE to handle imbalanced data streams with concept drift [15] 
where learn++. NSE deals with concept drift using a dynamic weighting strategy and SMOTE for balancing 
the minority class instances. An ensemble of subset of online sequential extreme learning machine (ESOS- 
ELM) [16] have constructed and stored weight matrices for every chunk. Gradual resampling ensemble 
(GRE) [17] used clustering technique for selecting the minority class samples from previous chunk. To generate 
training dataset, they have used density based spatial of applications with noise (DBSCAN) clustering with 
minority class and tried to minimize overlapping with majority class. 

Also, in few chunk-based methods preserve the minority samples from previous chunk which are 
merged with the minority samples in the succeeding chunk to get enough number of minority samples, 
however, this assumption may fail as imbalance ratio may not be fixed and may be changing over the time. 
Review shows that bagging based ensembles are useful for improving the performance of classifier for dealing 
with imbalance issue. Proposed method is based on bagging approach and compared with following state-of- 
the-art bagging methods. i) over bagging [18], this technique relies on a random over sampling of minority 
class to acquire each subset of dataset. Here every subset will include all the original examples and duplicate 
samples of randomly selected instances of the minority class; ii) synthetic minority oversampling technique 
(SMOTE) bagging [18], this approach makes use of SMOTE algorithm for creating new instances from 
minority class. To increase diversity in subset majority class instances are selected randomly; iii) under bagging 
[19], here instead of using under sampling, it uses oversampling technique for generating subsets from original 
dataset. Because of undersampling size of subset gets reduced; and iv) under over bagging [19], this approach 
uses both undersampling and oversampling along-with SMOTE bagging. 

Most of the methods reviewed designed for two class classification in data stream, so there was one 
minority while another one was majority class. But these methods were failed to handle multiple minority 
classes and the dynamic imbalance ratio in multi-class data streams. Also, the size of chunk considered affect 
the performance of a model, explicitly when the data stream is imbalanced. In this paper, we have proposed a 
hybrid dynamic chunk ensemble model (HDCEM) for classification of multi-class insect imbalanced data 
streams. In proposed ensemble model decision tree is used as a candidate classifier. For test data classification 
dynamic ensemble selection is used. The proposed model has following advantages: i) It is able to perform the 
multi-class classification in non-stationary data streams, ii) Imbalance issue is resolved using novel split based 
resampling ensemble algorithm, iii) It can handle abrupt and gradual concept drifts, as features of both online 
and chunk-based learning are combined, and iv) For test data dynamic ensemble selection is applied. 
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2. RESEARCH METHOD 

2.1. Dataset 

Proposed model has been evaluated using mosquito insect stream dataset released by authors [20]. 
This dataset has 33 features and six class labels. The class labels are the species of three types of mosquitoes 
from both sexes. Details of these mosquito species are given as shown. 

— Aedes aegypti. This mosquito species is commonly known as yellow fever mosquito. It is involved in 
spreading dengue fever, zika fever, chikungunya, mayaro, yellow fever viruses, and other disease agents 
[21]. 

— Aedes albopictus. This mosquito species called as Asian tiger mosquito or forest mosquito. It can spread 
diseases including yellow fever, dengue fever, and chikungunya fever [21]. 

— Culex quinquefasciatus. This is known as the Southern house mosquito. It is a medium-sized mosquito 
found in tropical and subtropical regions of the world. It is important in transmission of wuchereria 
bancrofti, avian malaria, and arboviruses including vSt, and louis encephalitis virus [22]. 

Features of this dataset are extracted by processing optical signal by using signal processing 
techniques. These features include wing beat frequency, various statistics from temporal representation, 
complexity measures of signal spectrum and so on. There are three variations of these datasets with abrupt, 
gradual and recurring concept drifts. Authors have generated this insect stream datasets with concept drifts 
using optical sensor based smart trap for catching the insects. The dataset with different concept drifts is 
generated by doing the variations in the temperature which may affect on the distribution of features of the 
insects. 


2.2. Description of proposed model 

In this section we describe our proposed model with pseudo-code. The proposed model is shown in 
Figure 1. Consider a data stream S = {xi, yi}, where xi is an input feature vector and yi € {c1,c2,...,cm} is 
output variable or class label of xi. Overall proposed model is described in pseudo-code under Figure 2. In the 
first phase, the model is trained on dynamic sized chunks. Instead of fixing the size of chunk initially, the 
chunks of dynamic size are formed by considering enough number of instances from all classes of the dataset 
(lines 5-6). These chunks are used for training the ensemble model. Stability of model is achieved by 
monitoring the error rate and applying statistical test on the variances in prediction error [23] (lines 7-14). Here, 
hybrid ensemble model is trained using the whole chunk, at the same time one special competent classifier is 
trained using every instance from the chunk, so abrupt or gradual drift, if exists, can be handled effectively. 
For handling imbalanced issue in chunks, split based resampling algorithm is used which is described in 
pseudo-code under Figure 3. For test data k-nearest neighbors (KNN) based dynamic ensemble selection is 
used (lines 14-21). 


Read Forma Train Base Test an Extend Retrain Perform Classify 
samples chunk from Classifiers Ensemble current Ensembl cine Test 
from data from Model chunk e model fe ERER msn 
Insect samples ensemble using Test b using using — 
Data with del usi Data and 7. Split- Ensemble dynamic 
model using 5 adding a , Classifier 
Stream enough curent Find Bagging ensemble 
pas greater s for del 
samples of hunk variance : model. 
chun! f number Given 
every class o 
prediction of Test Data 
instance 
sin Combine online 
current classifier with 


ensemble model 


Train Online Classifier using 
every sample from the chunk 


Figure 1. Framework for proposed model 


To deal with imbalance issue in data stream we have implemented split bagging technique which is 
presented in pseudo-code under Figure 3. In multi-class imbalance data set, along with dynamic imbalance 
ratio of classes, there are other data difficulty factors like noisy minority instances and overlapping classes 
which deteriorates the performance of classifiers, although we generated the balanced dataset. In addition to 
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this diversity of training subsets is also crucial for improving the performance of classifier. Proposed algorithm 
has systematically created number of subsets by splitting the classes into partitions based upon the size of 
minority class and every partition is filled with the equal number of instances from majority class which are 
selected using negative binomial distribution [24] using as (1) (lines 1-13) and synthetic minority examples. 


p(mjn) = ("7 )p”q™ (1) 


Where, m = number failures for given n = number of successes and p = q = 0.5. While generating synthetic 
minority instances instead of generating randomly, synthetic examples are generated only from the safe 
minority samples and using the nearest neighbors for every class. So, the synthetic examples which will be 
generated will not contain any noisy or hard to learn instances (lines 14-15). Train the classifiers in ensemble 
model using balanced subsets (lines 16-18). 


1: Input: S: {xi, yi} Imbalanced Data Stream, yi = {cl, c2,.., cp} N: Number of 
component classifiers, Dk: Data chunk with dynamic size, Dt: Test data chunk for 
building ensemble model Mt: Initial ensemble model, Dl: Extended data chunk. 
chval: value returned by statistical test, th: predefined threshold value, Vt: 
Variance in prediction of test instances, Vl: Variance in prediction of test 
instances when ensemble classifier trained on extended chunk, Dtest: Test data 
for assessing the performance of Ensemble model, Ch: Ensemble Classifier with N 
number of component classifiers. 

2: Mt = o 


3: Output: Ensemble model Ch 
4: for all xi in S 
5: Dk ={x1, x2,..., xt } # collect samples from stream with enough number of samples 
of each class 
6: Dt ={Select random number of samples as a test data} 
7 Mt, Dk, Ch = SplitBagging (Dk, N) 
8: Ch = Ch U Mt 
9: Vt = CalculateVariance (Mt, Dt) 
10: Dl = {xt...xtd } # Add more samples in previous chunk ,this is an extended 
chunk. 
11: Ml, Dl, Ch= SplitBagging(D1,N) 
12: Vl =CalculateVariance (M1,Dt) 
13: chval = checkStability(Vt,V1l) 
14: if chval > th 
15: Cl = SplitBagging (D1) 
16: Ch=Ch U C1 
17: for i <-- 1 to |Dt] 
18: Yp <- Knn-DES(Ch,D1,Dt) # dynamic ensemble selection 
19: end for 
20: end if 
21: end for 
Figure 2. Pseudo code for HDCEM 
i: Input: Dt:{xi,yi} imbalanced data chunk, C={cl,c2,....ck}, k: number of classes, 
L:No.candidate classifiers in Ensemble model, Qmaj = Size of Majority 
class, Qmin = Size of minority class, p : Number of classifiers in 
ensemble 
2: Output: bs = Balanced Data, Ch = Trained Ensemble model 
3: for i = 1 to p-1 
4: csi = |size of class ci| 
5: if csi $ Qmin =0 
6: Npci = csi/Qmin 
Ts else 
8: Npci =csi/Qmin +1 
9: for j=1 to k-1 
10: for 1 = 1 to Npcj 
ais Select Qmin nearest neighbors from minority class from class j 
using Negative Binomial Distribution using equation no (1) and assign to 
partition partlj 
12: Add partlj for class j and add to bs 
13: end for 
14: Select k number of nearest neighbors from minority class Qmin for class j 
and Take only safe minority class examples and generate samples using SMOTE 
and add this in bs 
15: end for 
16: for j = 1 toL 
17: Select partition from bs for every class and create a subset and add to Btr 
Cl <- L(Btr) 
18: Ch <- Ch U Cl 
19: Return ensemble model Ch, Btr 


Figure 3. Pseudo code for split bagging 
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Proposed model uses dynamic ensemble selection for classification of test data. Pseudo-code for 
KNN-dynamic ensemble selection (KNN-DES) is shown in Figure 4. Here for every instance from test data, 
region of competence is computed based on k-nearest neighbors from training dataset. Only those k classifiers 
of selected k-nearest neighbors are used for deciding the label of test instance by maximizing the probability 
of prediction of that respective class label. 


Input :Btr : Balanced Training Chunk, Bte:Test Data , Ch: Ensemble model . 


2: Output : Output label for every test instance, xj € Bte 
3: for every instance xj € Bte 
4: find Xt= N k-nearest neighbors from training chunk Btr for instance, xj 
Ss for t = 1 toN 
6: Wt = 1/dt # dt is an Euclidean distance between xj and xt, xt e€Xt 
7: end for 
8: Normalize weight wt = 
rN, wt 
9: for each cl € Ch 
yj=C (xj) = argmax ( ZL, Pr( yk | xj € Btr ,cl ) * wt 
end for 
10. Return yj 


Figure 4. Pseudo code for KNN-DES 


3. RESULTS AND DISCUSSION 

In this section, we have done comparative analysis of proposed model HDCEM- split bagging with 
HDCEM-SMOTE bagging, HDCEM-over bagging and HDCEM-under over bagging. We intent to verify the 
potency of the proposed split bagging in multi-class imbalanced data streams. When data is imbalanced for 
assessing the performance of model rather than using accuracy F-meaure, precision minority class recall 
performance measures are used. The precision, also called as True positive rate or specificity, is the ratio of 
correctly predicted positive instances to total predicted positive instances. the recall, also called as sensitivity, 
is the ratio of true positive instances to actual positive instances. Fl score or F-measure, is weighted average 
of precision and recall. These performance measures are defined as (2), (3), and (4). 


Precision = — a 
TP+FP 
TP 
Recall = aa > 
F — Measure = 2 Erecision+Recall) A 


(Precision+Recall) 


3.1. Experimental results 

The F-measure and minority class recall for imbalanced insect data streams with abrupt, gradual 
concept drifts at different chunk size are shown using following graphs. From Figures 5(a) and (b) it is clear 
that the proposed HDCEM with split bagging model has given stable performance for both abrupt and gradual 
drift insect data streams and outperformed over SMOTE bagging, over bagging and under over bagging for 
different chunk sizes. Minority class recall using proposed model is better than other techniques. 

Average minority class recall in abrupt drift insect data stream is 78% and 71% in gradual drift insect 
data stream. In abrupt drift, the accuracy is more because there is one dedicated component classifier used 
which learns every instance from current chunk. While for gradual drift insect data stream, although chunk size 
is more as compared to abrupt drift data stream chunk size, minority class recall has not improved. 

Reason behind this is that there might be increased number of hard to learn instances or increased 
imbalance ratio. Average overall accuracy achieved is 91%, but due to limited space graphical analysis is not 
shown here. To support this result, we have applied non parametric Mann-Whitney’s U statistical test [25]. 
Results of this statistical test in the form of average rank values (R+) and (R-) between proposed model with 
split bagging and with other bagging techniques are shown in Table 1. Table 1 depicts that the proposed model 
has outformed. 
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Figure 5. Comparative analysis using F-measure and minority class recall for (a) insect data stream with 
abrupt drift and (b) insect data stream with gradual drift 


Table 1. Mann-Whitney’s U statistical test results 
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Figure 6. Computational time for chunk processing 


4. CONCLUSION 

For the classification of data streams, concept drift and imbalance issues are the major problems. In 
this paper we have proposed HDCEM for multi-class data stream to deal with these issues. HDCEM generates 
an ensemble model which is trained on data chunks whose size is decided dynamically rather than fixing it a 
prior. Also, for handling dynamic imbalance issue we have proposed Split based Bagging algorithm which can 
handle noisy, hard to learn minority and majority instances present in the dataset. In addition to this, instead of 
applying direct majority voting ensemble algorithm for test data prediction, k-nearest neighbor based dynamic 
ensemble selection is used. Experimental results showed that proposed model has outperformed, but it is 
computationally expensive. The time requirement for processing multiple classes in data stream is more so for 
future work we can implement proposed model using distributed environment platforms like Hadoop or Spark. 
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